Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add interface to launch parallel dygraph by multiprocessing #26044

Merged

Conversation

chenwhql
Copy link
Contributor

@chenwhql chenwhql commented Aug 7, 2020

PR types

New features

PR changes

APIs

Describe

This PR add multiprocessing start method start_processes and spawnfor dygraph data parallel training.

1. Start method difference

  • start by launch

python -m paddle.distributed.launch --selected_gpus=0,1 train.py

  • start by spawn

python train.py

and add spawn in __main__ method, for example:

paddle.distributed.spawn(train_mnist,
    args=(args,),
    nprocs=args.nprocs,
    join=True)

2. Simple example

from __future__ import print_function

import paddle
import paddle.nn as nn
import paddle.optimizer as opt
import paddle.distributed as dist

class LinearNet(nn.Layer):
    def __init__(self):
        super(LinearNet, self).__init__()
        self._linear1 = nn.Linear(10, 10)
        self._linear2 = nn.Linear(10, 1)
        
    def forward(self, x):
        return self._linear2(self._linear1(x))

def train(print_result=False):
    # 1. enable dynamic mode
    paddle.disable_static()
    
    # 2. initialize parallel environment
    dist.init_parallel_env()

    # 3. create data parallel layer & optimizer
    layer = LinearNet()
    dp_layer = paddle.DataParallel(layer)

    loss_fn = nn.MSELoss()
    adam = opt.Adam(
        learning_rate=0.001, parameters=dp_layer.parameters())

    # 4. run layer
    inputs = paddle.randn([10, 10], 'float32')
    outputs = dp_layer(inputs)
    labels = paddle.randn([10, 1], 'float32')
    loss = loss_fn(outputs, labels)
    
    if print_result is True:
        print("loss:", loss.numpy())
    
    loss = dp_layer.scale_loss(loss)
    loss.backward()
    dp_layer.apply_collective_grads()

    adam.step()
    adam.clear_grad()

# Usage 1: only pass function. 
# If your training method no need any argument, and 
# use all visible devices for parallel training. 
if __name__ == '__main__':
    dist.spawn(train)

# Usage 2: pass function and arguments.
# If your training method need some arguments, and 
# use all visible devices for parallel training.
if __name__ == '__main__':
    dist.spawn(train, args=(True,))

# Usage 3: pass function, arguments and nprocs.
# If your training method need some arguments, and 
# only use part of visible devices for parallel training.
# If your machine hold 8 cards {0,1,2,3,4,5,6,7},
# this case will use cards {0,1}; If you set 
# CUDA_VISIBLE_DEVICES=4,5,6,7, this case will use
# cards {4,5}
if __name__ == '__main__':
    dist.spawn(train, args=(True,), nprocs=2)

# Usage 4: pass function, arguments, nprocs and selected_gpus.
# If your training method need some arguments, and 
# only use part of visible devices for parallel training,
# but you can't set your machine's environment varibale 
# CUDA_VISIBLE_DEVICES, such as it is None or all cards
# {0,1,2,3,4,5,6,7}, you can pass `selelcted_gpus` to 
# select the GPU cards you want to use. For example,
# this case will use cards {4,5} if your machine hold 8 cards.
if __name__ == '__main__':
    dist.spawn(train, args=(True,), nprocs=2, selelcted_gpus='4,5')

3. API change

  • Add 4 new apis:

    • paddle.distributed.spawn: start mulit-process training by spawn method
    • paddle.distributed.init_parallel_env: init parallel environment variables & get paralllel strategy
    • paddle.distributed.get_rank: get current process rank
    • paddle.distributed.get_world_size: get current world size
  • Move 2 old apis:

    • paddle.prepare_context (fluid.dygraph.prepare_context) -> paddle.distributed.prepare_context
    • paddle.ParallelEnv (fluid.dygraph.ParallelEnv) -> paddle.distributed.ParallelEnv
  • Refine 1 old api:

    • paddle.DataParallel (fluid.dygraph.DataParallel): Set strategy as an optional argument
  • Deprecate 1 old apis:

    • paddle.distributed.prepare_context (fluid.dygraph.prepare_context): replace by paddle.distributed.init_parallel_env later

4. Correctness

Verify the correctness of the interface in the following models:

  • Mnist: test_parallel_dygraph_mnist.py
  • SeResNext: test_parallel_dygraph_se_resnext.py
  • Transformer: test_parallel_dygraph_transformer.py

5. Related docs

image
image
image
image
image

@PaddlePaddle PaddlePaddle deleted a comment from paddle-bot-old bot Aug 7, 2020
python/paddle/distributed/parallel.py Outdated Show resolved Hide resolved
python/paddle/distributed/spawn.py Outdated Show resolved Hide resolved
python/paddle/distributed/spawn.py Outdated Show resolved Hide resolved
python/paddle/distributed/spawn.py Outdated Show resolved Hide resolved
python/paddle/distributed/spawn.py Outdated Show resolved Hide resolved
python/paddle/distributed/parallel.py Outdated Show resolved Hide resolved
python/paddle/distributed/parallel.py Outdated Show resolved Hide resolved
python/paddle/distributed/parallel.py Outdated Show resolved Hide resolved
python/paddle/distributed/parallel.py Outdated Show resolved Hide resolved
Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spawn 模式最好有性能对比?

Copy link
Member

@guru4elephant guru4elephant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ParallelStrategy = core.ParallelStrategy


def init_parallel_env(backend='nccl'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL is an underlying communication library, I don't think it's necessary to let users know we have different backends here. If we want to support operating system such as windows that doesn't support NCCL, it's better to detect the operating system inside the init function to use other communication library, such as gloo. I highly recommend to remove backend argument currently for simplicity of usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, I think it is okay to remove it, we can discuss removing this argument by cherry-pick

@guru4elephant guru4elephant self-requested a review August 28, 2020 05:11
@chenwhql chenwhql closed this Aug 28, 2020
@chenwhql chenwhql reopened this Aug 28, 2020
Copy link
Member

@guru4elephant guru4elephant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the backend argument for simplicity

@chenwhql
Copy link
Contributor Author

Spawn 模式最好有性能对比?

感谢意见,确实应该有的,我后续出个报告可以吗?这个接口开发工作开展的时间有点短,近一周一直在讨论迭代接口形态,这个又要随2.0-beta发布,所以仅验证了正确性,性能对比还没来得及开展

这个接口在理论上与launch并无差别,只是换了一种多进程的启动方式,没有增加多余的实现,理论上不会有差别,同时这只是一种可选的启动方式,也不影响launch原来的使用

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@chenwhql chenwhql requested a review from kolinwei August 28, 2020 06:12
Copy link
Collaborator

@raindrops2sea raindrops2sea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants